Tree-Based Methods¶

Tree-based methods stratify or segment the predictor space into a number of simple regionsJames. As the spliting rules to make these decision regions can be summerised in a tree structure, these approaches are called decision trees.

For prediction for an observation we can use the mean or mode of the tarining observations in the region to which it belongs.

Alike to SVM, these can be used for both classification and regression.


So why not just stick to decision trees? Well although they are useful for interpretation, they are not as competative for supervised learning methods, so we'll also talk about bagging, random forests, and boosting; which all rely on producing multiple trees and combining them to gain a single prediction. This typically improves accuracy a lot, but at the expense of interpretability.

If you get: "ImportError: Bad git executable." make sure to install git

Introduction¶

The "palmer penguins" datasetGorman contains data for 344 penguins from 3 different species and from 3 islands in the Palmer Archipelago, Antarctica.


Artwork by @allison_horst

In [5]:
Image(os.path.join(image_path, "lter_penguins.png"))
Out[5]:
In [6]:
display(penguins.head())
display(penguins.info())
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None
In [7]:
sns.pairplot(penguins, hue="species", markers=shape_dict, palette=col_dict)
plt.show()

Decision Trees¶

A decision tree breaks data down by asking a series of questions in order to categorise samples into the same class.

Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to produce binary trees, meaning nodes always have two children.

TerminologyJames

  • The regions $R_1$ and $R_2$ are known as terminal nodes or leaves of the tree.
  • The points where the predictor space is split are known as the internal nodes.
  • The segments of the trees that connect the nodes are branches.

Notes

  • Other algorithms, such as ID3, can have more children.
  • Leaves are typically drawn upside down, so they are at the bottom of the tree
  • "scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now." https://scikit-learn.org/stable/modules/tree.html
  • The arrows dont show up on older versions of Scikit Learn due to a weird interaction with sns so I need to use plt.style.context("classic"). This has been fixed in later versions past what is currently availble on anaconda
In [17]:
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'], l_labels, r_labels, tp_labels)

Information Gain¶

An algorithm starts at a tree root and then splits the data based on the features that gives the largest information gain. This splitting procedure occours until all the samples within a given node all belong to the same classRaschka, the maximum depth of the tree is reached, or a split cannot be found that reduces impurityGéron.

To split using information gain relies on calculating the difference between an impurity measure of a parent node and the sum of the impurities of its child nodes; information gain being high when impurity of the child nodes is low.

Three impurity measures that are commonly used in binary decision trees are the classification error, gini impurity, and entropy1.


  1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

Classification Error¶

This is simply the fraction of the training observations in a region that does not belong to the most common class:

$E = 1 - \substack{max\\k}(\hat p_{mk})$.

$\hat p_{mk}$ here is the proportion of training observations in the $m$th region that are from the $k$th class.


James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Notes

  • However this is rarely used in practice because geni index or entropy are better.. its not even an option in scikit-learn

Gini Impurity¶

Measures the total variance across $K$ classes,

$G = \sum^K_{k=1}\hat p_{mk}(1-\hat p_{mk})$.

It is a measure of node "purity" as a small value indicates a node contains predominantly overvations from a single class.


James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Notes

  • It takes a small value if all of the $\hat p_{mk}$'s are close to 0 or 1
In [20]:
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'], l_labels, r_labels, tp_labels, impurity = True)

Entropy Impurity¶

Entropy is an alternative measurement, given by

$D = -\sum^K_{k=1}\hat p_{mk}log \hat p_{mk}$.

Entropy will take on a value near 0 if the $\hat p_{mk}$'s are all near 0 or 1, therefore will take on a small value if the $m$th node is pure.

In [22]:
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'], l_labels, r_labels, tp_labels, impurity = True)

Pruning¶

[Insert about pruning]

Categorical Varibles¶

Do not use Label Encoding if your categorical data is not ordinal with DecisionTreeClassifier(), you'll end up with splits that do not make sense, as the data will be treat as numericWeb.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive. However this can deteriorate the performance of decision trees as it leads to sparse features, which can mess up feature importanceWeb.

TODO

  • then what do we do?

Web. https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree

Notes

  • "scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now." https://scikit-learn.org/stable/modules/tree.html

Advantages¶

  • Trees are very easy to explain

  • Decision trees potentially more mirror human decision-making than the regression and classification approaches previously discussed.

  • Trees can be displayed graphically, and are easily interpreted.

  • Trees can easily handle qualitative predictors without the need to create dummy variables.


James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Assessing Feature Importance¶

Decision trees allow us assess the importance of each feature for classifying the data. The importance (or Gini importance) of a feature is the normalized total reduction of the criterion (e.g. Gini) brought by that feature.


https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Notes

  • Lets fit all the penguins on the full data (without categories) and see what features are used to split the data and where they are in the tree.
  • For bagging regression trees we can use the RSS.
In [53]:
tree_feat_import(DT, X, y, feat_names,
                 'Feature Importances for Classifying Adelie and Gentoo Penguins')
bill_length_mm: 0.015
bill_depth_mm: 0.029
flipper_length_mm: 0.956
body_mass_g: 0.000
total: 1.0
In [54]:
fig, axes = plt.subplots(figsize=(10,5))
with plt.style.context("classic"):
    tree.plot_tree(DT,
                   feature_names=penguins_bin.drop("species",axis=1).columns, 
                   class_names=['Adelie', 'Gentoo'],
                   filled = True)
    plt.show()

Multiclass¶

Tree-based classifiers are inherently multiclass...

[Insert about multiclass]

In [56]:
X = penguins_cont.drop("species",axis=1).values
y = penguins_cont[["species"]].replace({'Adelie': 0, 'Gentoo': 1, "Chinstrap":2}).values.flatten()

DT = DecisionTreeClassifier(criterion='gini',
                            max_depth = None,
                            random_state=42)

DT.fit(X, y)

feat_names = penguins_cont.drop("species",axis=1).columns

fig, axes = plt.subplots(figsize=(10,5))
with plt.style.context("classic"):
    tree.plot_tree(DT,
                   feature_names=feat_names, 
                   class_names=['Adelie', 'Gentoo', 'Chinstrap'],
                   filled = True)
    plt.show()
print("['Adelie', 'Gentoo', 'Chinstrap']")
['Adelie', 'Gentoo', 'Chinstrap']

Disadvantages¶

  • Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.

  • A small change in the data can cause a large change in the final estimated tree.

However, by aggregating many decision trees, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved. We introduce these concepts in the next section.


James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Rotation of the Data¶

As can be seen by the descision boundary, a decision tree is quite boxy. Furthermore, how the model makes a decision boundary is going to be affected by the rotation of the data (as DTs create straight lines).

In [26]:
regions_tree(Xblbd, DT, blbd.columns[:-1], ['Adelie', 'Gentoo'],  [[37, 22],[51.5, 22]], r_labels, tp_labels)
regions_tree(Xblbd, DT2, blbd.columns[:-1], ['Adelie', 'Gentoo'], [[57.5, 19.5],[57.5, 14.5]])

"Classical Approachs" or Trees?¶

Which approach may be better really depends on the relationship between the features and the responce.

If there is a highly non-linear and complex relationship between the features and the response then decision trees may outperform classical approaches. However if the relationship between the features and the response is well approximated by a linear model, then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure.

The relative performances of tree-based and classical approaches can be assessed by estimating the test error, using either cross-validation or the validation set approach.


James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Notes

  • Linear regression assumes a model of the form $f(X)=\beta 0+\sum_{j=1}^p X_j\beta_j$,

  • Regression trees assume a model of the form $f(X)=\sum_{m=1}^Mc_m\cdot1_{(X\in R_m)}$, where $R_1, …, R_M$ represent a partition of feature space.

In [27]:
Image(os.path.join("fig_8-7.png"))
Out[27]:

Overfitting¶

A limit on nodes, or tree depth, is often set to avoid overfitting due to a deep tree2.


  1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
In [28]:
DT = DecisionTreeClassifier(criterion='gini',
                            max_depth = None,
                            random_state=42)
regions_tree(Xflbl, DT, flbl.columns[:-1], ['Adelie', 'Gentoo'])

Exercises¶

[THEORY BASED!]

1.

2.

Averaging Methods¶

Averaging methods build several separate estimators and then, as their name suggets, average their predictions.

By reducing the variance these tend to perform better than any single base estimator1.

Averaging methods generally work best when the predictors are as independent as possible, so one way of achiving this is to get diverse classifiers. This increases the chance they each make different types of errors which in combination will improve the overall accuracy3.


  1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
  2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".

NOTES

  • High Variance means if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
  • Low Variance means a procedure will yield similar results if applied repeatedly to distinct data sets.
  • Decision trees typically suffer from high variance.
  • Linear regression tends to have low variance, if the ratio of n to p is moderately large.

Bagging¶

A bagging classifier is an ensemble of base classifiers, each fit on random subsets of a dataset. Their predictions are then pooled or aggregated to form a final prediction. This reduces variance of an estimator so can be a simple way to reduce overfitting and increase prediction accuracy1.

This is because the variance of the mean $\bar Z$ of $n$ independent observations, $Z_1,...,Z_n$, is given by $\sigma^2/n$; meaning averaging a set of observations reduces variance. Can can use this by taking many separate training sets, $B$, from the population, building a separate prediction model using each training set, $\hat f^1(x),\hat f^2(x),...,\hat f^B(x)$, and average the resulting predictions2:

$\hat f_{avg}(x)=\frac{1}{B}\sum_{b=1}^B\hat f^b(x)$.

However we generally do not have access to multiple training sets so instead we can bootstrap by taking repeated samples from a (single) training data set to create multiple bootstrapped training data sets, $B$. We then train our method on the $b$th bootstrapped training set to get $f^{∗b}(x)$ , and finally average all the predictions, to obtain2:

$\hat f_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat f^{∗b}(x)$.


  1. https://scikit-learn.org/stable/modules/ensemble.html

  2. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

Specifically, bagging is when sampling is produced with replacement2, and without replacement being called pasting3.

Pasting is designed to use smaller sample sizes than the training dataset in cases where the training dataset does not fit into memory3.

Both bagging and pasting allow training to be sampled several times across multipule predictors, with bagging only allowing several samples for the same predictor 4.


  1. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
  2. Breiman, L. (1999). Pasting small votes for classification in large databases and on-line. Machine learning, 36(1-2), 85-103.
  3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".

NOTES

  • If we wanted to use pasting we would just set bootstrap=False.

In practice, bagging tends to work best with complex models1, so although bagging is useful when applied to regression methods, they are particularly useful for decision trees 2.

Bagging has been demonstrated to give impressive improvements in accuracy by combining together hundreds or even thousands of trees into a single procedure.

These trees are grown deep, and are not pruned so each individual tree has high variance, but low bias. Averaging these $B$ trees reduces the variance.

To apply bagging to regression trees, we simply:

  1. construct $B$ regression trees using $B$ bootstrapped training sets,
  2. average the resulting predictions.

To apply bagging to decision trees, we simply:

  1. construct $B$ decision trees using $B$ bootstrapped training sets,
  2. take a majority vote of the resulting predictions.

  1. https://scikit-learn.org/stable/modules/ensemble.html
  2. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
  3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".

Notes

  • boosting will generally work better with weak models.
  • [Explanation of bias]
  • Have a look at Majority Voting from the last lecture.

Out-of-Bag Error Estimation¶

Using a bagged model means we can get a validation/test error without using cross-validation.

On average, each bagged tree uses around two-thirds of the observations1, REF, therefore we can predict these out-of-bag (OOB) observations using only the trees that were not fit using those observations.

With a sufficiently amount of bags, OOB error is virtually equivalent to leave-one-out cross-validation error, which is convenient when performing bagging on large data sets for which cross-validation would be computationally onerous1.


  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

NOTES

  • we use oob_score = True

Variable Importance Measures¶

Bagging improves prediction accuracy at the expense of interpretability.

However we can use the feature importance method as previously discussed, to get an overall summary of the importance of each predictor across trees.

NOTES

  • Bagging can be used with most classifiers, although you can only assess feature importances if a method provides a feature_importances_.
  • "Recall that one of the advantages of decision trees is the attractive and easily interpreted diagram that results... However, when we bag a large number of trees, it is no longer possible to represent the resulting statistical learning procedure using a single tree, and it is no longer clear which variables are most important to the procedure."1

  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
In [74]:
# get the importances for the features
importances = np.mean([
    tree.feature_importances_ for tree in bag.estimators_
], axis=0)

importances_sd = np.std([
    tree.feature_importances_ for tree in bag.estimators_
], axis=0)

feat_names = penguins_cont.drop("species",axis=1).columns
importances_series = pd.Series(importances,index=feat_names).sort_values(ascending = False)

# plot the important features
importances_series.plot.bar(legend =False, grid=False, yerr=importances_sd)
plt.title('Feature Importances')

plt.xticks(rotation=45,ha='right')
plt.tight_layout()

#plt.savefig('forest_importances.png', dpi=300)
plt.show()

Random Forests¶

Random forests are essentally bagged tree classifiers, but decorrelate the trees by using a random sample of features each time a split in a tree is considered.

The random forest algorithm can therefore be summarized in four steps1:

  1. Draw a random bootstrap sample of size $n$.
  2. Grow a decision tree from the bootstrap sample. At each node:

    a. Randomly select $d$ features without replacement (typically the square root of the total number of predictors).

    b. Split the node using the feature that provides the best split according to the objective function.

  3. Repeat the steps above $k$ times.

  4. Aggregate the prediction by each tree to assign the class label by majority vote.

By not allowing the model to use the majority of the available predictors, we ensure the bagged trees look different from each other. If there is a particularly strong set of predictors in the data, then without randomly selecting features, the bagged trees will look quite similar to each other and predictions will be highly correlated. Averaging highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities2.


  1. Raschka2017
  2. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

NOTES

  • Instead of using majority vote, as was done in the origional publication2, in Sklearn the RandomForestClassifier averages the probabilistic prediction.
  • Remember we cannot use graphviz or sklearn.tree.plot_tree on the whole forest as we did for the trees, as each tree is built differently.
  • if a random forest is built using $d$ = $D$ (all features), then this is simply bagging.
  • Rather than using the bagging method above we can use one of the inbuilt methods Sklearn has specifically designed for fitting an ensemble of trees. Its generally quicker as can be seen below...

TODO

  • okay mabe not on small data then? Maybe make a graph as data increases using make_classification

  1. L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001
In [92]:
X = penguins_cont.drop("species",axis=1).values
y = penguins_cont[["species"]].replace({'Adelie': 0, 'Gentoo': 1, "Chinstrap":2}).values.flatten()
feat_names = penguins_cont.drop("species",axis=1).columns
class_names=['Adelie', 'Gentoo', 'Chinstrap']

RF = RandomForestClassifier(criterion='gini',
                            n_estimators=100,
                            max_samples=0.8,
                            max_features = 2,
                            random_state=42,
                            n_jobs=-1)

RF.fit(X, y)

plot_trees = 3
fig, axes = plt.subplots(ncols=plot_trees, figsize=(10,5))
with plt.style.context("classic"):
    for i in range(plot_trees):
        plt.sca(axes[i])
        plot_tree(RF.estimators_[i],
                       feature_names=feat_names, 
                       class_names=class_names,
                       filled = True)
    plt.show()

Lets look at how a descion boundary created by a bagged tree could generalise better than a single tree

TODO

  • do this on training and val data to show how the val data could cross the boundary?
In [22]:
tvf_regions()

Extremely Randomized Trees¶

As averaging methods work best when the predictors are as independent as possible2, we may specifically want our trees in our ensemble to be more independent.

An extratree is similar to a tree classifier except it more randomized and thusly produces more complex trees.

Instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.

When used in an ensemble, this usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias2.


  1. Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3-42.
  2. https://scikit-learn.org/stable/modules/ensemble.html
  3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".

As we can see how just one tree looks, this is a very complex model!

In [31]:
fig, axes = plt.subplots(figsize=(10, 5))
with plt.style.context("classic"):
    tree.plot_tree(ET,
                   feature_names=penguins_cont.columns, 
                   class_names=['Adelie', 'Gentoo'],
                   filled = True)
    plt.show()

Exercises¶

[THEORY BASED!]

1.

2.

Boosting¶

Exercises¶

[THEORY BASED!]

1.

2.

References¶

  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
  2. Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
  3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  4. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.